140 ◾ Bioinformatics
dataset from validated data resources, such as 1000 Genomes, OMNI, and hapmap, and
then it uses the model to filter out the putative artifacts from the called variants. The
application of the model results in assigning a log-odds ratio score (VQSLOD) for each
variant that measures how likely that variant is real based on the data used in the training.
The VQSLOD is added to the INFO field of the variant. The variants are then filtered based
on a threshold. SNPs and InDels are recalibrated separately. The variant calibration and
filtering are performed in two steps:
(i) Building of the recalibration model:
The recalibration model is built using VariantRecalibrator tool. The input file for this
tool is the variants to be recalibrated “-V” and the known training dataset “--resource”.
The latter must be downloaded from a reliable source such as GATK resource bundle. The
fitted model is used to estimate the relationship between the probability that whether a
variant is true or artifact and continuous covariates that include QD (quality depth), MQ
(Mapping quality), and FS (FisherStrand). The VQSLOD is estimated based on Gaussian
mixture model whether a variant is true versus being false. Each variant in the input VCF
file is assigned a VQSLOD in INFO field of the VCF file and the variants are ranked by
VQSLOD. A tranche sensitivity threshold can be provided in “-tranche” as a percentage.
Several thresholds can be set. The output of this step is a recalibrated VCF file and other
files including tranches, which will be used by ApplyVQSR, and plot files.
cd refvcf
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx
cd ..
mkdir VQSR
cd vcf
~/software/gatk-4.2.3.0/gatk --java-options \
-Xmx10g VariantRecalibrator \
-R ../refgenome/Homo_sapiens_assembly38.fasta \
-V allsamplesSNP_chr21.vcf \
--trust-all-polymorphic \
-tranche 100.0 \
-tranche 99.95 \
-tranche 99.90 \
-tranche 99.85 \
-tranche 99.80 \
-tranche 99.00 \
-tranche 98.00 \
-tranche 97.00 \
-tranche 90.00 \
--max-gaussians 6 \
--resource:1000G,known=false,training=true,truth=true,prior=10.0
\